Frequently, matrices and vectors are needed for computation and are a convenient way to store and access data. Vectors are more commonly many rows with a single column. A significant amount of work has been done to make computers very fast at doing matrix math, and while the tradeoff is commonly framed as 'more memory for faster calculation', there is typically enough memory in contemporary computation devices to process chunks of matrices.
In Python's NumPy, vectors and matrices are referred to as arrays: a constant-sized collection of elements (of the same type - integer, floating point number, string of characters, etc.). Underneath, Python arrays use C for greater efficiency.
Note that this is different from the python list - lists are a python datatype, whereas arrays are objects that are made available via the python package numpy.
Array restrictions:
The array is the basis of all (fast) scientific computing in Python. We need to have a solid foundation of what an array is, how to use it, and what it can do.
By the end of this file you should have seen simple examples of:
Further reading:
https://docs.scipy.org/doc/numpy-dev/user/numpy-for-matlab-users.html
In [1]:
# Python imports
import numpy as np
While both data types hold a series of discrete information, arrays are stored more efficiently in memory and have significantly higher performance than Python lists. They also bring with them a host of properties and syntax that makes them more efficient, especially for numeric operations.
In [2]:
l = 20000
test_list = list(range(l))
test_array = np.arange(l)
print(type(test_list))
print(type(test_array))
In [3]:
print(test_list[:300]) # Print the first 300 elements
# (more on indexing in a bit):
In [4]:
print(test_array)
In [5]:
%timeit [np.sqrt(i) for i in test_list]
In [6]:
%timeit [np.sqrt(test_array)]
If statement says "10 loops, best of 3: [time]" it means the fastest of 10 repeated runs was recorded - then the 10 runs were repeated twice more, resulting in an overall fastest time.
In [7]:
test_array = np.array([[1,2,3,4], [6,7,8,9]])
print(test_array)
Index arrays using square brackets, starting from zero and specifying row, column:
In [8]:
test_array[0,3]
Out[8]:
Arrays are duck typed just like Python variables, that is to say that Python will try to determine what kind of variable it should be based on how it's used.
Numpy arrays are all the same type of variable. To check the data type (dtype) enter:
In [9]:
test_array.dtype
Out[9]:
Different variable types use different amounts of memory and can have an effect on performance for very large arrays.
Changing the type of array is possible via:
In [10]:
test_array = test_array.astype('float64')
print(test_array)
In [11]:
# We can create arrays of boolean values too:
bool_array = np.array([[True, True, False,True],[False,False,True,False]])
print(bool_array)
We can replace values in an array:
In [12]:
test_array[0,3]=99 # Assign value directly
print(test_array)
Deleting values from an array is possible, but due to the way they're stored in memory, it makes sense to keep the array structure. Often, a 'nan' is used (not a number) or some nonsensical value is used, i.e.: 0
or -1
.
Keep in mind that 'nan' only works for some types of arrays:
In [13]:
test_array[0,3] = 'nan'
print(test_array)
In [14]:
test_array[:,1] # Use the ':' to index along one dimension fully
Out[14]:
In [15]:
test_array[1,1:] # Adding a colon indexes the rest of the values
# (includes the numbered index)
Out[15]:
In [16]:
test_array[1,1:-1] # We can index relative to the first and last elements
Out[16]:
In [17]:
test_array[1,::2] # We can specify the indexing order
Out[17]:
In [18]:
test_array[1,1::-1] # We can get pretty fancy about it
# Index second row, second from first to second from
# last in reverse order.
Out[18]:
Logical Indexing
We can specify only the elements we want by using an array of True/False values:
In [19]:
test_array[bool_array] # Use our bool_array from earlier
Out[19]:
Using the isnan
function in numpy:
In [20]:
nans = np.isnan(test_array)
print(nans)
In [21]:
test_array[nans] = 4
print(test_array)
In [22]:
test_array_Vstacked = np.vstack((test_array, [1,2,3,4]))
print(test_array_Vstacked)
In [23]:
test_array_Hstacked = np.hstack((test_array, test_array))
print(test_array_Hstacked)
We can bring these dimensions back down to one via flatten
:
In [24]:
test_array_Hstacked.flatten()
Out[24]:
Caution: appending to numpy arrays frequently is memory intensive. Every time this happens, an entirely new chunk of memory needs to be used, so the old array is moved in memory to a new location.
It's faster to 'preallocate' an array with empty values, and simply populate as the computation progresses.
In [25]:
test_array
Out[25]:
In [26]:
print("The broadcasted array is: ", test_array[0,:])
test_array[0,:] * test_array
Out[26]:
However, if the dimensions don't match, it won't work:
In [27]:
print("The broadcasted array is: ", test_array[:,0])
#test_array[:,0] * test_array # Uncomment the line to see that the
# dimensions don't match
In [28]:
# Make use of the matrix transpose (also can use array.T)
np.transpose( test_array[:,0]*np.transpose(test_array) )
Out[28]:
In [29]:
print("The original array is: ", test_array)
print("The transposed array is: ", np.transpose(test_array) )
# Alternatively, using test_array as an opject:
print("The transposed array is: ", test_array.transpose() )
One of the most frequenly used properties of arrays is the dimension:
In [30]:
print("The original array dimensions are: ", test_array.shape)
print("The array transpose dimensions are: ", test_array.transpose().shape)
In [31]:
test_array2 = np.array([1,5,4,0,1])
print("The original array is: ", test_array2)
test_array3 = test_array2.sort() # Run the sort - note that the new variable isn't assigned
print("The reassigned array should be sorted: ", test_array3)
print("test_array2 after sort: ", test_array2)